Skip to content

KV handoff with DMA slicing APIs to avoid KV input/output copies.#1039

Open
quic-akuruvil wants to merge 4 commits into
quic:release/v1.22.0_tmpfrom
quic-akuruvil:dma_slice
Open

KV handoff with DMA slicing APIs to avoid KV input/output copies.#1039
quic-akuruvil wants to merge 4 commits into
quic:release/v1.22.0_tmpfrom
quic-akuruvil:dma_slice

Conversation

@quic-akuruvil

@quic-akuruvil quic-akuruvil commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Problem

If we don't use DMA slicing, in disaggregated serving, the QPC expect KV cache for all the batches as input, i.e if decode is BS=32 and lets say BS=4 got free, the QPC and LRT would expect KV caches again for all 32 batches. To fix this problem, DMA buffer slicing is introduced, where user can slice the DMA buffer into N Batches and write KV caches for each batch slot, by indexing the specific slot.

Idea

Disaggregated serving pipeline on QAIC with zero‑copy KV cache handoff.
Prefill to decode KV transfer happens through host (shared memory).
Shared memory is used so that there's no copy of KV cache when transferring from prefill to host.
Dump the kv cache from prefill devices to shared memory on host and then pass the pointer of shared memory to decode instance which loads up the kv cache directly from those host buffers.
This can be useful in the disaggregated setting for any large KV footprint. Since we are using DMA buffer slicing hence avoids taking KV as inputs between prefill decode sessions.

Optimization

Adds a new temporary QAICInferenceSession class (cloud_infer_kv_slice.py) that enables zero-copy KV-cache handoff between disaggregated prefill and decode sessions using shared DMA buffers and QAICRT API setDataWithSlices(). On the last prefill chunk, KV outputs are wired directly into the decode session's input slots via a sliced DMA descriptor — eliminating the Python/numpy copy at the prefill→decode boundary.

cluster_id="prefill" gives a pool of stages+1 slots for concurrent chunk pipelining; cluster_id="decode" gives a single fixed slot because decode is strictly sequential

Enables true prefill/decode overlap (exec-obj pool)

Existing method: uses a single QAICInferenceSession with one exec-obj. The CPU must call waitForCompletion() (blocking) before it can read KV outputs and set up the next call. Prefill and decode are strictly serialized.

KV slice method: uses separate cluster_id="prefill" and cluster_id="decode" sessions with exec-obj pools. setDataWithSlices is called before enqueue — the runtime knows where to write KV outputs before inference starts. This means:

  • A new prefill request can be enqueued on a free prefill exec-obj while decode is still running on its exec-obj
  • The prefill pool (stages+1 exec-objs) allows pipelined chunked prefill without stalling on waitForCompletion() between chunks

Sample Example Script

Also adds an end-to-end example (examples/disagg_serving/qwen3moe_disagg_mode_with_chunking_kvslice.py) demonstrating the full disaggregated serving flow for Qwen3-MoE with chunked prefill, PP (stages), TS, and DMA-sliced KV handoff.

quic-akuruvil and others added 4 commits June 9, 2026 15:05
Signed-off-by: Ann <quic_akuruvil@quicinc.com>
This PR adds MDP generation required for disaggregated serving for Prefill.
Supports both Pipeline Prefill + Tensor Slicing and passing custom cores
to the MDP generator. Also adds support for VLMs, compiler 'stages' option,
and layerwise export.

Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Ann <quic_akuruvil@quicinc.com>
Signed-off-by: Ann <quic_akuruvil@quicinc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants